Loading libraries¶
import matplotlib
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import numpy as np
import pandas as pd
import os, sys
import seaborn as sns
sns.set(style="ticks")
from pylab import rcParams
rcParams['figure.figsize'] = (7,5)
rcParams['figure.dpi'] = 120
rcParams['savefig.dpi'] = 120
rcParams['font.size'] = 20
rcParams['axes.facecolor'] = 'white'
matplotlib.style.use('ggplot')
import statsmodels.formula.api as smf
import statsmodels.api as sm
import missingno as msno
import sklearn
%matplotlib inline
loading player stats data¶
players = pd.read_csv("stats.csv.gz",index_col=0)
print(players.shape)
players.head()
Loading salary data¶
raw_salary = pd.read_csv("salary.csv.gz",index_col=0)
print(raw_salary.shape)
raw_salary.head()
I need the player's name and salary for this report only. The salary format needs change to standard numbers.¶
raw_salary['full-name'], raw_salary['Position'] = raw_salary['NAME'].str.split(',', 1).str
raw_salary.head()
to_str =[]
for i in raw_salary['SALARY']:
if i != 0:
to_str.append( i.replace("$","").replace(",",""))
raw_salary['SALARY'] = pd.to_numeric(to_str, errors='coerce')
salary = raw_salary[['full-name','SALARY']]
salary = salary.rename(columns={'full-name':'Player','SALARY':'Salary'})
print(salary.shape)
salary.head()
Change the unit of measure to million.¶
salary['Salary(million)'] = salary['Salary']/1000000
salary.head()
Merge and concatenate the salary and the player datasets.¶
df = pd.merge(salary,players,on="Player",how='outer')
print(df.shape)
df.head()
df.columns
Dataset structure reorganization¶
merge_data = df.groupby('Player').agg({col:'max' for col in df.columns})
merge_data = merge_data.drop(['Player','Salary'], axis=1)
merge_data = merge_data.loc[:, ~merge_data.columns.str.contains('^Unnamed')]
merge_data = merge_data.rename(columns={ 'Salary(million)':'Salary','3P':'P3','3P%':'P3%','2P':'P2','2P%':'P2%','2PA':'P2A'})
# merge_data.columns
We got a 634 column records and 29 features dataset.¶
print(merge_data.shape)
merge_data.head()
merge_data.to_csv("raw_players_salary.csv.gz", compression='gzip')
raw_players_salary = pd.read_csv("raw_players_salary.csv.gz")
The chart shows the missing value distribution. The reason for the plentiful number of missing values is that the data come from different Statistical agencies.¶
msno.matrix(raw_players_salary.sample(raw_players_salary.shape[0]))
In order to get a reliable dataset and an accuracy model, the record that has missing value will be dropped.¶
clear_players_salary = raw_players_salary.dropna()
# clear_players_salary_with_mean = raw_players_salary.fillna(df.mean())
# clear_players_salary = clear_players_salary_with_mean.fillna(0)
# msno.matrix(clear_players_salary.sample(clear_players_salary.shape[0]))
print(clear_players_salary.shape)
clear_players_salary.head()
There is no duplicate record of players in the dataset.¶
clear_players_salary['Player'].duplicated().sum()
clear_players_salary.to_csv("clear_players_salary.csv.gz", compression='gzip')
clear_players_salary = pd.read_csv("clear_players_salary.csv.gz",index_col=0)
The statistics table shows that mean, max, min of each feature.¶
clear_players_salary.describe()
The bar chart shows that most of the players get a salary of under ten million.¶
clear_players_salary['Salary'].hist()
sns.utils.axlabel('Salary (Million)','players')
plt.title("Player Salary", alpha=0.9)
plt.figure(figsize=(22, 20))
plt.rcParams['savefig.dpi'] = 120
plt.rcParams['figure.dpi'] = 120
The distribution density map below further confirms that only a few star players have a salary.¶
clear_players_salary['Salary'].hist(bins = 10,histtype = 'bar',normed =True)
clear_players_salary['Salary'].plot(kind='kde',style='k--')
plt.xlabel('SALARY')
plt.title("NBA Salary Density")
We can find that the mean salary is 7.8million. However, the max of salary is over 40 million.¶
print(clear_players_salary['Salary'].mean())
clear_players_salary['Salary'].max()
Might the player's position correlate with the salary?¶
Select the top 10 players of salary.¶
top10 = clear_players_salary.sort_values("Salary",ascending=False).head(10)
plt.bar(top10.Player,top10.Salary)
plt.ylabel("Salary")
plt.xlabel("Player" )
plt.xticks(rotation=45)
plt.title("TOP 10 NBA Salary")
There are six players belong to Point Guard in the Top 10 players.¶
top10['Pos']
By counting the number of players in each position, we could find the Shooting Guard (SG) has the most number of players. The Swingman(SF-PF) is the least number of players, and the position requires high technology as well.¶
| Position | Number |
|---|---|
| SG(Shooting Guard): | 105 |
| PF(Power Forward): | 77 |
| PG(Point Guard): | 71 |
| SF(Small Forward): | 68 |
| C(Center): | 63 |
| SF-PF (Swingman): | 1 |
clear_players_salary['Pos'].value_counts()
According to visualizing the distribution of the dataset, we could get information that there is no Center(C) above the 30 million salaries, and power guards(PG) are the most. The conclusion also matches with the status quo of the NBA league, the relative demise of Centers(C).¶
The further information form the chart is that the age of players who got above the 35 million salaries is between 30 to 35.¶
sns.set_style("whitegrid")
ax = sns.lmplot(x='Age', y='Salary', data=clear_players_salary,
hue='Pos', aspect=1,size=10, scatter_kws={"s": 200})
ax.set(xlabel='AGE', ylabel='Salary',title="Salary by Position")
# plt.legend(loc='upper right', title='Pos')
plt.figure(figsize=(10,6))
The average salary of point guards (PG) and center (C) is higher, but the salary of high-quality Point Guards(PG) is higher than the other team players. The salary of high-quality Point Guards(PG) is higher than the other team players.¶
plt.figure(figsize=(10,6))
ax = sns.barplot(x="Pos", y="Salary",data=clear_players_salary,capsize=.6)
ax.set(title="Salary by Position")
The different position has the same salary distribute the result that most players get less 10 million. We can conclude that the position has a specific relationship with the salary, but it has little consequence.¶
g = sns.FacetGrid(clear_players_salary, col="Pos", height=6)
g.map(plt.hist, "Salary")
Correlation analysis of various Features and salary¶
clear_players_salary.columns
plt.figure(figsize=(25, 12))
sns.heatmap(clear_players_salary.corr(), annot=True, fmt=".2f", cmap=plt.cm.Greens)
There are many features in the dataset. I select 12 features that have strong relevance to salary.¶
| Position | Position brief | |
|---|---|---|
| MP | Minutes Played | 场均上场时间 |
| FG | Field Goal | 场均投篮命中 |
| FGA | Field Goal Attempts | 场均投篮出手次数 |
| 2PA | 2-Point Field Goal Attempts | 场均二分投篮出手次数 【2分球机会高于3分球】 |
| 2P | 2-Point Field Goals | 场均二分投篮命中次数 |
| FT | Free Throw | 罚球命中次数 |
| FTA | Free Throws Attempted | 场均罚球次数 |
| DRB | Defensive Rebounds | 场均防守篮板 【仅与 防守篮板有关。。。】 |
| AST | Assists | 场均助攻 【助攻也很重要】 |
| TOV | turnovers | 失误次数 |
| PTS | points | 场均得分【这个最重要。。。0.63最高了】 |
selected_df = clear_players_salary[['Salary','MP','FG','FGA','P2A','P2','FT','FTA','DRB','AST','TOV','PTS']]
print(selected_df.shape)
selected_df.head()
We could get information from the figure that a single parameter has little effect on salary. Free Throw (FT) and points(PTS) are the highest correlation with the salary in all Features.¶
plt.figure(figsize=(25, 12))
sns.heatmap(selected_df.corr(), annot=True, fmt=".2f", cmap=plt.cm.Greens)
The matrix scatters chart shows the relationship between data more intuitive. However, It is quite hard to find the relevance between salary with others' features.¶
_ = pd.plotting.scatter_matrix(selected_df, figsize=(20, 15), diagonal='hist')
selected_df.cov()
A single feature does not determine player salary. Multiple characteristics should determine player salary.¶
selected_df.to_csv("fmanual_feature_Select_players_salary.csv.gz", compression='gzip')